An error is better than a surprise

Recall our both_na() function from Chapter 2, that finds the number of entries where vectors x and y both have missing values:

both_na <- function(x, y) {
  sum(is.na(x) & is.na(y))
}

We had an example where the behavior was a little surprising:

x <- c(NA, NA, NA)
y <- c( 1, NA, NA, NA)
both_na(x, y)
## Warning in is.na(x) & is.na(y): Länge des längeren Objektes
##       ist kein Vielfaches der Länge des kürzeren Objektes
## [1] 3

The function works and returns 3, but we certainly didn’t design this function with the idea that people could pass in different length arguments.

Using stopifnot() is a quick way to have your function stop, if a condition isn’t met. stopifnot() takes logical expressions as arguments and if any are FALSE an error will occur.

# Define troublesome x and y
x <- c(NA, NA, NA)
y <- c( 1, NA, NA, NA)

both_na <- function(x, y) {
  # Add stopifnot() to check length of x and y
  stopifnot(length(x) == length(y))
  sum(is.na(x) & is.na(y))
}

# Call both_na() on x and y
#both_na(x, y)

An informative error is even better

Using stop() instead of stopifnot() allows you to specify a more informative error message. Recall the general pattern for using stop() is:

if (condition) {
  stop("Error", call. = FALSE)
}

Writing good error messages is an important part of writing a good function! We recommend your error tells the user what should be true, not what is false. For example, here a good error would be “x and y must have the same length”, rather than the bad error “x and y don’t have the same length”.

Let’s use this pattern to write a better check for the length of x and y.

# Define troublesome x and y
x <- c(NA, NA, NA)
y <- c( 1, NA, NA, NA)

both_na <- function(x, y) {
  # Replace condition with logical
  if (length(x) != length(y)) {
    # Replace "Error" with better message
    stop("x and y must have the same length", call. = FALSE)
  }  
  sum(is.na(x) & is.na(y))
}

# Call both_na() 
#both_na(x, y)

A different kind of surprise: side effects

Side effects describe the things that happen when you run a function that alters the state of your R session. If foo() is a function with no side effects (a.k.a. pure), then when we run x <- foo(), the only change we expect is that the variable x now has a new value. No other variables in the global environment should be changed or created, no output should be printed, no plots displayed, no files saved, no options changed. We know exactly the changes to the state of the session just by reading the call to the function.

Can you identify which of these functions doesn’t have side effects?

show_missings <- function(x) {
  n <- sum(is.na(x))
  cat("Missing values: ", n, "\n", sep = "")
  x
}

replace_missings <- function(x, replacement) {
  x[is.na(x)] <- replacement
  x
}

plot_missings <- function(x) {
  plot(seq_along(x), is.na(x))
  x
}

exclude_missings <- function() {
  options(na.action = "na.exclude")
}

answer: replace_missings.

Of course functions with side effects are crucial for data analysis. You need to be aware of them, and deliberate in their usage. It’s ok to use them if the side effect is desired, but don’t surprise users with unexpected side effects.

sapply is another common culprit

sapply() is another common offender returning unstable types. The type of output returned from sapply() depends on the type of input.

Consider the following data frame and two calls to sapply():

df <- data.frame(
  a = 1L,
  b = 1.5,
  y = Sys.time(),
  z = ordered(1)
)

A <- sapply(df[1:4], class) 
B <- sapply(df[3:4], class)

str(A)
## List of 4
##  $ a: chr "integer"
##  $ b: chr "numeric"
##  $ y: chr [1:2] "POSIXct" "POSIXt"
##  $ z: chr [1:2] "ordered" "factor"
str(B)
##  chr [1:2, 1:2] "POSIXct" "POSIXt" "ordered" "factor"
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:2] "y" "z"

What type of objects will be A and B be?

A will be a list, B will be a character matrix.

Using purrr solves the problem

This unpredictable behaviour is a sign that you shouldn’t rely on sapply() inside your own functions.

So, what do you do? Use alternate functions that are type consistent! And you already know a whole set: the map() functions in purrr.

In this example, when we call class() on the columns of the data frame we are expecting character output, so our function of choice should be: map_chr():

library(purrr)

df <- data.frame(
  a = 1L,
  b = 1.5,
  y = Sys.time(),
  z = ordered(1)
)

A <- map_chr(df[1:4], class) 
B <- map_chr(df[3:4], class)

str(A)
str(B)

Except that gives us errors: Error: Result 3 must be a single string, not a character vector of length 2. This is a good thing! It alerts us that our assumption (that class() would return purely character output) is wrong.

Let’s look at a couple of solutions. First, we could use map() instead of map_chr(). Our result will always be a list, no matter the input.

library(purrr)

# sapply calls
A <- sapply(df[1:4], class) 
B <- sapply(df[3:4], class)
C <- sapply(df[1:2], class) 

# Demonstrate type inconsistency
str(A)
## List of 4
##  $ a: chr "integer"
##  $ b: chr "numeric"
##  $ y: chr [1:2] "POSIXct" "POSIXt"
##  $ z: chr [1:2] "ordered" "factor"
str(B)
##  chr [1:2, 1:2] "POSIXct" "POSIXt" "ordered" "factor"
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:2] "y" "z"
str(C)
##  Named chr [1:2] "integer" "numeric"
##  - attr(*, "names")= chr [1:2] "a" "b"
# Use map() to define X, Y and Z
X <- map(df[1:4], class)
Y <- map(df[3:4], class)
Z <- map(df[1:2], class)

# Use str() to check type consistency
str(X)
## List of 4
##  $ a: chr "integer"
##  $ b: chr "numeric"
##  $ y: chr [1:2] "POSIXct" "POSIXt"
##  $ z: chr [1:2] "ordered" "factor"
str(Y)
## List of 2
##  $ y: chr [1:2] "POSIXct" "POSIXt"
##  $ z: chr [1:2] "ordered" "factor"
str(Z)
## List of 2
##  $ a: chr "integer"
##  $ b: chr "numeric"

A type consistent solution

If we wrap our solution into a function, we can be confident that this function will always return a list because we’ve used a type consistent function, map():

col_classes <- function(df) {
  map(df, class)
}

But what if you wanted this function to always return a character string?

One option would be to decide what should happen if class() returns something longer than length 1. For example, we might simply take the first element of the vector returned by class().

col_classes <- function(df) {
  # Assign list output to class_list
  class_list <- map(df, class)
  
  # Use map_chr() to extract first element in class_list
  map_chr(class_list, 1)
}

# Check that our new function is type consistent
df %>% col_classes() %>% str()
##  Named chr [1:4] "integer" "numeric" "POSIXct" "ordered"
##  - attr(*, "names")= chr [1:4] "a" "b" "y" "z"
df[3:4] %>% col_classes() %>% str()
##  Named chr [1:2] "POSIXct" "ordered"
##  - attr(*, "names")= chr [1:2] "y" "z"
df[1:2] %>% col_classes() %>% str()
##  Named chr [1:2] "integer" "numeric"
##  - attr(*, "names")= chr [1:2] "a" "b"

Or fail early if something goes wrong

Another option would be to simply fail. We could rely on map_chr()’s type consistency to fail for us:

col_classes <- function(df) {
  map_chr(df, class)
}

df %>% col_classes() %>% str()

Or, check the condition ourselves and return an informative error message. We’ll implement this approach in this exercise.

As you write more functions, you’ll find you often come across this tension between implementing a function that does something sensible when something surprising happens, or simply fails when something surprising happens. Our recommendation is to fail when you are writing functions that you’ll use behind the scenes for programming and to do something sensible when writing functions for users to use interactively.

(And by the way, flatten_chr() is yet another useful function in purrr. It takes a list and removes its hierarchy. The suffix _chr indicates that this is another type consistent function, and will either return a character string or an error message.)

col_classes <- function(df) {
  class_list <- map(df, class)
  
  # Add a check that no element of class_list has length > 1
  if (any(map_dbl(class_list, length) > 1)) {
    stop("Some columns have more than one class", call. = FALSE)
  }
  
  # Use flatten_chr() to return a character vector
  flatten_chr(class_list)
}

# Check that our new function is type consistent
#df %>% col_classes() %>% str()
#df[3:4] %>% col_classes() %>% str()
#df[1:2] %>% col_classes() %>% str()

Programming with NSE functions

Let’s take a look at a function that uses the non-standard evaluation (NSE) function filter() from the dplyr package:

big_x <- function(df, threshold) {
  dplyr::filter(df, x > threshold)
}

This big_x() function attempts to return all rows in df where the x column exceeds a certain threshold. Let’s get a feel for how it might be used.

diamonds_sub <- diamonds[1:1000,]

# Use big_x() to find rows in diamonds_sub where x > 7
big_x(diamonds_sub,  7)
## # A tibble: 1 x 10
##   carat cut     color clarity depth table price     x     y     z
##   <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  1.27 Premium H     SI2      59.3    61  2845  7.12  7.05   4.2

When things go wrong

Now, let’s see how this function might fail. There are two instances in which the non-standard evaluation of filter() could cause surprising results:

The x column doesn’t exist in df.
There is a threshold column in df.

Let’s illustrate these failures. In each case we’ll use big_x() in the same way as the previous exercise, so we should expect the same output. However, not only do we get unexpected outputs, there is no indication (i.e. error message) that lets us know something might have gone wrong.

# Remove the x column from diamonds
diamonds_sub$x <- NULL

# Create variable x with value 1
x <- 1

# Use big_x() to find rows in diamonds_sub where x > 7
big_x(diamonds_sub, 7)
## # A tibble: 0 x 9
## # ... with 9 variables: carat <dbl>, cut <ord>, color <ord>,
## #   clarity <ord>, depth <dbl>, table <dbl>, price <int>, y <dbl>, z <dbl>
# Create a threshold column with value 100
diamonds_sub$threshold <- 100

# Use big_x() to find rows in diamonds_sub where x > 7
big_x(diamonds_sub, 7) 
## # A tibble: 0 x 10
## # ... with 10 variables: carat <dbl>, cut <ord>, color <ord>,
## #   clarity <ord>, depth <dbl>, table <dbl>, price <int>, y <dbl>,
## #   z <dbl>, threshold <dbl>

Instead of failing with an error of warning, big_x() gave an incorrect answer. This is dangerous!

What to do?

To avoid the problems caused by non-standard evaluation functions, you could avoid using them. In our example, we could achieve the same results by using standard subsetting (i.e. []) instead of filter(). For more insight into dealing with NSE and how to write your own non-standard evaluation functions, we recommend reading Hadley’s vignette on the topic. Also, programming with the NSE functions in dplyr will be easier in a future version.

If you do need to use non-standard evaluation functions, it’s up to you to provide protection against the problem cases. That means you need to know what the problem cases are, to check for them, and to fail explicitly.

To see what that might look like, let’s rewrite big_x() to fail for our problem cases.

big_x <- function(df, threshold) {
  # Write a check for x not being in df
  if (!"x" %in% colnames(df)) {
    stop("df must contain variable called x", call. = FALSE)
  }
  
  # Write a check for threshold being in df
  if ("threshold" %in% colnames(df)) {
    stop("df must not contain variable called threshold", call. = FALSE)
  }
  
  dplyr::filter(df, x > threshold)
}

A hidden dependence

A classic example of a hidden dependence is the stringsAsFactors argument to the read.csv() function (and a few other data frame functions.)

When you see the following code, you don’t know exactly what the result will be:

pools <- read.csv("../xDatasets/swimming_pools.csv")

That’s because if the argument stringsAsFactors isn’t specified, it inherits its value from getOption("stringsAsFactors"), a global option that a user may change.

Just to prove that this is the case, let’s illustrate the problem.

# This is the default behavior
options(stringsAsFactors = TRUE)

# Read in the swimming_pools.csv to pools
pools <- read.csv("../xDatasets/swimming_pools.csv")

# Examine the structure of pools
str(pools)
## 'data.frame':    20 obs. of  4 variables:
##  $ Name     : Factor w/ 20 levels "Acacia Ridge Leisure Centre",..: 1 2 3 4 5 6 19 7 8 9 ...
##  $ Address  : Factor w/ 20 levels "1 Fairlead Crescent, Manly",..: 5 20 18 10 9 11 6 15 12 17 ...
##  $ Latitude : num  -27.6 -27.6 -27.6 -27.5 -27.4 ...
##  $ Longitude: num  153 153 153 153 153 ...
# Change the global stringsAsFactors option to FALSE
read.csv("../xDatasets/swimming_pools.csv", stringsAsFactors = FALSE)
##                                         Name
## 1                Acacia Ridge Leisure Centre
## 2                            Bellbowrie Pool
## 3                                Carole Park
## 4                Centenary Pool (inner City)
## 5                             Chermside Pool
## 6                Colmslie Pool (Morningside)
## 7             Spring Hill Baths (inner City)
## 8                 Dunlop Park Pool (Corinda)
## 9                      Fortitude Valley Pool
## 10 Hibiscus Sports Complex (upper MtGravatt)
## 11                 Ithaca Pool ( Paddington)
## 12                             Jindalee Pool
## 13                                Manly Pool
## 14            Mt Gravatt East Aquatic Centre
## 15       Musgrave Park Pool (South Brisbane)
## 16                            Newmarket Pool
## 17                              Runcorn Pool
## 18                             Sandgate Pool
## 19      Langlands Parks Pool (Stones Corner)
## 20                         Yeronga Park Pool
##                                        Address  Latitude Longitude
## 1           1391 Beaudesert Road, Acacia Ridge -27.58616  153.0264
## 2                 Sugarwood Street, Bellbowrie -27.56547  152.8911
## 3   Cnr Boundary Road and Waterford Road Wacol -27.60744  152.9315
## 4             400 Gregory Terrace, Spring Hill -27.45537  153.0251
## 5                 375 Hamilton Road, Chermside -27.38583  153.0351
## 6                 400 Lytton Road, Morningside -27.45516  153.0789
## 7             14 Torrington Street, Springhill -27.45960  153.0215
## 8                      794 Oxley Road, Corinda -27.54652  152.9806
## 9         432 Wickham Street, Fortitude Valley -27.45390  153.0368
## 10         90 Klumpp Road, Upper Mount Gravatt -27.55183  153.0735
## 11               131 Caxton Street, Paddington -27.46226  153.0103
## 12                 11 Yallambee Road, Jindalee -27.53236  152.9427
## 13                  1 Fairlead Crescent, Manly -27.45228  153.1874
## 14 Cnr wecker Road and Newnham Road, Mansfield -27.53214  153.0943
## 15       100 Edmonstone Street, South Brisbane -27.47978  153.0168
## 16                71 Alderson Stret, Newmarket -27.42968  153.0062
## 17                   37 Bonemill Road, Runcorn -27.59156  153.0764
## 18               231 Flinders Parade, Sandgate -27.31196  153.0691
## 19             5 Panitya Street, Stones Corner -27.49769  153.0487
## 20                     81 School Road, Yeronga -27.52053  153.0185
# Read in the swimming_pools.csv to pools2
pools2 <- read.csv("../xDatasets/swimming_pools.csv", stringsAsFactors = FALSE)

# Examine the structure of pools2
str(pools2)
## 'data.frame':    20 obs. of  4 variables:
##  $ Name     : chr  "Acacia Ridge Leisure Centre" "Bellbowrie Pool" "Carole Park" "Centenary Pool (inner City)" ...
##  $ Address  : chr  "1391 Beaudesert Road, Acacia Ridge" "Sugarwood Street, Bellbowrie" "Cnr Boundary Road and Waterford Road Wacol" "400 Gregory Terrace, Spring Hill" ...
##  $ Latitude : num  -27.6 -27.6 -27.6 -27.5 -27.4 ...
##  $ Longitude: num  153 153 153 153 153 ...

Legitimate use of options

In general, you want to avoid having the return value of your own functions depend on any global options. That way, you and others can reason about your functions without needing to know the current state of the options.

It is, however, okay to have side effects of a function depend on global options. For example, the print() function uses getOption("digits") as the default for the digits argument. This gives users some control over how results are displayed, but doesn’t change the underlying computation.

Let’s take a look at an example function that uses a global default sensibly. The print.lm() function has the options digits with default max(3, getOption("digits") - 3).

# Start with this
options(digits = 8)

# Fit a regression model
fit <- lm(mpg ~ wt, data = mtcars)

# Look at the summary of the model
summary(fit)
## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5431 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858 < 2.2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.294e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.0459 on 30 degrees of freedom
## Multiple R-squared:  0.75283,    Adjusted R-squared:  0.74459 
## F-statistic: 91.375 on 1 and 30 DF,  p-value: 1.294e-10
# Set the global digits option to 2
options(digits = 2)

# Take another look at the summary
summary(fit)
## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.543 -2.365 -0.125  1.410  6.873 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   37.285      1.878   19.86  < 2e-16 ***
## wt            -5.344      0.559   -9.56  1.3e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3 on 30 degrees of freedom
## Multiple R-squared:  0.753,  Adjusted R-squared:  0.745 
## F-statistic: 91.4 on 1 and 30 DF,  p-value: 1.29e-10

Session Info

sessionInfo()
## R version 3.5.2 (2018-12-20)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 16299)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=German_Switzerland.1252  LC_CTYPE=German_Switzerland.1252   
## [3] LC_MONETARY=German_Switzerland.1252 LC_NUMERIC=C                       
## [5] LC_TIME=German_Switzerland.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] purrr_0.3.0      ggplot2_3.1.0    dplyr_0.8.0.1    gapminder_0.3.0 
## [5] kableExtra_1.0.1 knitr_1.21      
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.0        plyr_1.8.4        pillar_1.3.1     
##  [4] compiler_3.5.2    prettydoc_0.2.1   tools_3.5.2      
##  [7] digest_0.6.18     gtable_0.2.0      evaluate_0.12    
## [10] tibble_2.0.1      viridisLite_0.3.0 pkgconfig_2.0.2  
## [13] rlang_0.3.1       cli_1.0.1         rstudioapi_0.9.0 
## [16] yaml_2.2.0        xfun_0.4          withr_2.1.2      
## [19] httr_1.4.0        stringr_1.4.0     xml2_1.2.0       
## [22] hms_0.4.2         webshot_0.5.1     grid_3.5.2       
## [25] tidyselect_0.2.5  glue_1.3.0        R6_2.4.0         
## [28] fansi_0.4.0       rmarkdown_1.11    readr_1.3.1      
## [31] magrittr_1.5      scales_1.0.0      htmltools_0.3.6  
## [34] assertthat_0.2.0  rvest_0.3.2       colorspace_1.4-0 
## [37] utf8_1.1.4        stringi_1.3.1     lazyeval_0.2.1   
## [40] munsell_0.5.0     crayon_1.3.4